
Store Sales - Forecasting by Linear Regression¶
We will go deeper into these section:
- Interpolation for Oil Prices
- Exploratory Data Analysis
- Modelling
The objective is to predict sales using the test_data. This data is provided by Corporación Favorita, a major grocery retailer based in Ecuador. The company also has operations in several other South American countries.
Click here to learn about the company.


The dataset includes 54 stores and 33 product families. The time series spans from 2013-01-01 to 2017-08-31. Kaggle has divided the data into two parts: train and test. The test data covers a 15-day period following the end date of the training data. This date range in the test data is crucial for defining a cross-validation strategy and generating new features.
*The objective is to predict sales using the test_data.*
We will examine five datasets step by step:
- Train
- Test
- Store
- Transactions
- Daily Oil Price
Train data includes time series data for combinations of stores and product families. The sales column indicates the total sales for a product family at a particular store on a specific date. Fractional sales are possible due to the sale of items in fractional quantities (e.g., 1.5 kg of cheese vs. 1 bag of chips). The onpromotion column shows the total number of items in a product family being promoted at a store on a given date.
Store data provides information about each store, including city, state, type, and cluster.
Transaction data is highly correlated with the sales column in the train dataset, allowing us to understand sales patterns at the stores.
Daily Oil Price data is another valuable dataset. Given Ecuador's dependence on oil, its economic health is particularly sensitive to fluctuations in oil prices. This data will help us determine how oil prices impact different product families.
The "Additional Notes" section in the data description may reveal significant patterns or anomalies. Here are some notes to keep in mind:¶
- Public sector wages are paid bi-weekly on the 15th and the last day of the month, which could influence supermarket sales.
- A magnitude 7.8 earthquake struck Ecuador on April 16, 2016, prompting relief efforts that included donations of essential goods. This event significantly affected supermarket sales for several weeks following the earthquake.
Let's get started!
1. Importing Important Packages¶
import numpy as np
import pandas as pd
import os
import gc
import warnings
import statsmodels.api as sm # If PACF - ACF will be used...
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Configurations
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:.2f}'.format
warnings.filterwarnings('ignore')
2. Importing Several Dataset¶
# Import
train = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\train.csv')
test = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\test.csv')
stores = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\stores.csv')
#sub = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\sub.csv')
transactions = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\transactions.csv').sort_values(["store_nbr", "date"])
# Datetime
train["date"] = pd.to_datetime(train.date)
test["date"] = pd.to_datetime(test.date)
transactions["date"] = pd.to_datetime(transactions.date)
# Data types
train.onpromotion = train.onpromotion.astype("float16")
train.sales = train.sales.astype("float32")
stores.cluster = stores.cluster.astype("int8")
train.head()
| id | date | store_nbr | family | sales | onpromotion | |
|---|---|---|---|---|---|---|
| 0 | 0 | 2013-01-01 | 1 | AUTOMOTIVE | 0.00 | 0.00 |
| 1 | 1 | 2013-01-01 | 1 | BABY CARE | 0.00 | 0.00 |
| 2 | 2 | 2013-01-01 | 1 | BEAUTY | 0.00 | 0.00 |
| 3 | 3 | 2013-01-01 | 1 | BEVERAGES | 0.00 | 0.00 |
| 4 | 4 | 2013-01-01 | 1 | BOOKS | 0.00 | 0.00 |
3. Transactions¶
Deep Delve Into Transactions Dataset
transactions.head(15)
| date | store_nbr | transactions | |
|---|---|---|---|
| 1 | 2013-01-02 | 1 | 2111 |
| 47 | 2013-01-03 | 1 | 1833 |
| 93 | 2013-01-04 | 1 | 1863 |
| 139 | 2013-01-05 | 1 | 1509 |
| 185 | 2013-01-06 | 1 | 520 |
| 231 | 2013-01-07 | 1 | 1807 |
| 277 | 2013-01-08 | 1 | 1869 |
| 323 | 2013-01-09 | 1 | 1910 |
| 369 | 2013-01-10 | 1 | 1679 |
| 415 | 2013-01-11 | 1 | 1813 |
| 461 | 2013-01-12 | 1 | 1473 |
| 507 | 2013-01-13 | 1 | 542 |
| 553 | 2013-01-14 | 1 | 1780 |
| 599 | 2013-01-15 | 1 | 1680 |
| 645 | 2013-01-16 | 1 | 1933 |
This feature shows a strong correlation with sales, but initially, we need to aggregate the sales data to identify the relationship. Transactions refer to the number of customers visiting the store or the number of invoices issued in a day.
The sales data represents the total revenue for a product family at a specific store on a given date. Since products can be sold in fractional quantities (e.g., 1.5 kg of cheese instead of just one bag of chips), fractional sales values are possible.
Therefore, transactions will be a key feature in the model. In the subsequent sections, we will create new features based on transaction data.
Transactions Heatmap¶
temp = pd.merge(train.groupby(["date", "store_nbr"]).sales.sum().reset_index(), transactions, how = "left")
print("Spearman Correlation between Total Sales and Transactions: {:,.5f}".format(temp.corr("spearman").sales.loc["transactions"]))
px.line(transactions.sort_values(["store_nbr", "date"]), x='date', y='transactions', color='store_nbr',title = "Transactions Heatmap" )
Spearman Correlation between Total Sales and Transactions: 0.81746
Transactions Boxplot¶
a = transactions.copy()
a["year"] = a.date.dt.year
a["month"] = a.date.dt.month
px.box(a, x="year", y="transactions" , color = "month", title = "Transactions Boxplot")
Transactions exhibit a consistent pattern, with all months showing similar trends except for December, as observed from 2013 to 2017 in the boxplot. Additionally, the previous plot revealed the same pattern across all stores, where sales consistently increased towards the end of the year.
Transactions by using monthly average sales!
We have identified a pattern that boosts sales, which occurs at the end of the year. Additionally, transactions tend to rise during the spring season and decline afterward.
import plotly.express as px
# Processing Data
a = transactions.set_index("date").resample("M").transactions.mean().reset_index()
a["year"] = a.date.dt.year
# Making Graphic
fig = px.line(
a,
x='date',
y='transactions',
color='year',
title="Monthly Average Transactions",
labels={'transactions': 'Average Transactions', 'date': 'Date'},
line_shape='linear', # Bentuk garis linear
markers=True, # Menampilkan marker di titik data
template='plotly_white', # Tema latar putih
color_discrete_sequence=px.colors.qualitative.Set1, # Skema warna diskrit
)
# Adding Layout and Design Fatures
fig.update_layout(
title={
'text': "Monthly Average Transactions Over Time",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top',
},
xaxis_title='Date',
yaxis_title='Average Transactions',
legend_title_text='Year',
font=dict(
family="Arial, sans-serif",
size=14,
color="black"
),
hovermode="x unified",
margin=dict(l=50, r=50, t=80, b=50),
plot_bgcolor='rgba(240,240,240,0.8)', # Background plot berwarna abu-abu terang
paper_bgcolor='white', # Background kertas putih
)
# Adding Grid Line
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='lightgray')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgray')
fig.show()
When we look at their relationship, we can see that there is a highly correlation between total sales and transactions also.
px.scatter(temp, x = "transactions", y = "sales", trendline = "ols", trendline_color_override = "red")
The days of the week play a crucial role in shopping behavior, revealing a significant pattern. Stores experience higher transaction volumes on weekends, with Saturday being the peak day for shopping. This trend remains consistent from 2013 to 2017.
import plotly.express as px
# Menyalin data dan menambahkan kolom tahun dan hari dalam minggu
a = transactions.copy()
a["year"] = a.date.dt.year
a["dayofweek"] = a.date.dt.dayofweek + 1 # Menambahkan 1 untuk membuat hari mulai dari 1 (Senin) hingga 7 (Minggu)
# Menghitung rata-rata transaksi per tahun dan hari dalam minggu
a = a.groupby(["year", "dayofweek"]).transactions.mean().reset_index()
# Membuat grafik garis
fig = px.line(
a,
x="dayofweek",
y="transactions",
color="year",
title="Average Transactions by Day of the Week",
labels={
"dayofweek": "Day of the Week",
"transactions": "Average Transactions",
"year": "Year"
},
markers=True, # Menambahkan marker pada titik data
line_shape='linear', # Bentuk garis linear
template="plotly_white", # Tema latar putih
color_discrete_sequence=px.colors.qualitative.Plotly, # Palet warna diskrit
)
# Menambahkan layout dan fitur desain
fig.update_layout(
title={
'text': "Average Transactions by Day of the Week",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top',
},
xaxis_title='Day of the Week',
yaxis_title='Average Transactions',
font=dict(
family="Arial, sans-serif",
size=14,
color="black"
),
hovermode="x unified",
margin=dict(l=50, r=50, t=80, b=50),
plot_bgcolor='rgba(240,240,240,0.8)', # Background plot berwarna abu-abu terang
paper_bgcolor='white', # Background kertas putih
)
# Mengatur sumbu agar grid lebih jelas
fig.update_xaxes(
tickmode='array',
tickvals=[1, 2, 3, 4, 5, 6, 7],
ticktext=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
showgrid=True,
gridwidth=1,
gridcolor='lightgray'
)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgray')
fig.show()
4. Oil Price Impact¶
The economy poses a significant challenge for both governments and individuals, impacting various aspects in either positive or negative ways. In this context, Ecuador, being reliant on oil, will experience fluctuations in its model due to changes in oil prices. To gain a deeper understanding, I examined Ecuador's economic situation and discovered a relevant article from the IMF. Reading this article is recommended if you aim to enhance your models using oil data.
As shown below, there are some missing data points in the daily oil price dataset. Various imputation methods could be used to handle these gaps, but I opted for a straightforward approach. Linear interpolation is well-suited for this time series data. By examining a time series plot of oil prices, you can observe the trend and estimate the missing data points.
# Importing Oil Dataset
oil = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\oil.csv')
oil["date"] = pd.to_datetime(oil.date)
# Resample
oil = oil.set_index("date").dcoilwtico.resample("D").sum().reset_index()
# Interpolate
oil["dcoilwtico"] = np.where(oil["dcoilwtico"] == 0, np.nan, oil["dcoilwtico"])
oil["dcoilwtico_interpolated"] = oil.dcoilwtico.interpolate()
# Melt Data
p = oil.melt(id_vars=['date']+list(oil.keys()[5:]), var_name='Legend')
# Plot
fig = px.line(
p.sort_values(["Legend", "date"], ascending=[False, True]),
x='date',
y='value',
color='Legend',
title="Daily Oil Price",
width=1000, # Mengatur lebar plot
height=600 # Mengatur panjang plot
)
# Menampilkan plot
fig.show()
Is it accurate to say that Ecuador relies heavily on oil? Can this dependency be observed from the data?
To address this, let's first examine the correlations between sales and transactions. Although the correlation values are not very strong, there is a negative sign for sales. This might provide some insight. Intuitively, if the daily oil price is high, it suggests that Ecuador's economy might be struggling, leading to increased product prices and decreased sales. This indicates a negative relationship.
temp = pd.merge(temp, oil, how = "left")
print("Correlation with Daily Oil Prices")
print(temp.drop(["store_nbr", "dcoilwtico"], axis = 1).corr("spearman").dcoilwtico_interpolated.loc[["sales", "transactions"]], "\n")
fig, axes = plt.subplots(1, 2, figsize = (15,5))
temp.plot.scatter(x = "dcoilwtico_interpolated", y = "transactions", ax=axes[0], color = 'y')
temp.plot.scatter(x = "dcoilwtico_interpolated", y = "sales", ax=axes[1], color = "g")
axes[0].set_title('Daily oil price & Transactions', fontsize = 15)
axes[1].set_title('Daily Oil Price & Sales', fontsize = 15);
Correlation with Daily Oil Prices sales -0.30 transactions 0.04 Name: dcoilwtico_interpolated, dtype: float64
We should never base your decisions solely on a single graph or result! It's crucial to reassess your perspective and formulate new hypotheses.
If we had relied on simple outputs, like those above, and concluded that there was no relationship with oil prices, we might have mistakenly decided not to use oil price data.
Now that we understand the importance of thorough analysis, let’s create a scatter plot, but this time we should focus on different product families. Most plots reveal a similar pattern: when the daily oil price is below approximately 70, sales tend to be higher. We can observe two distinct clusters here: one above 70 and one below. This distinction is quite clsight.
5. Sales¶
The primary goal is to predict store sales for each product family. To achieve this, the sales column must be analyzed in depth. We need to understand various aspects, including seasonality, trends, anomalies, and any correlations with other time series data.
When we analyze the stores using a correlation matrix, we find that most of them are quite similar. However, stores such as 20, 21, 22, and 52 appear to be somewhat different.
# Data preparation
a = train[["store_nbr", "sales"]]
a["ind"] = 1
a["ind"] = a.groupby("store_nbr").ind.cumsum().values
a = pd.pivot(a, index="ind", columns="store_nbr", values="sales").corr()
mask = np.triu(a.corr())
# Plot
plt.figure(figsize=(20, 20))
ax = sns.heatmap(
a,
annot=True,
fmt='.1f',
cmap='coolwarm', # Mengganti palet warna
square=True,
mask=mask,
linewidths=1,
cbar=True # Menambahkan colorbar
)
# Menambahkan title pada colorbar dan mengatur ukuran
colorbar = ax.collections[0].colorbar
colorbar.set_label('Correlation Value', fontsize=15)
colorbar.ax.tick_params(labelsize=10) # Mengatur ukuran font angka pada colorbar
# Mengatur ukuran colorbar secara manual
colorbar.ax.set_aspect(20) # Mengatur rasio aspek colorbar (Lebar/Tinggi)
plt.subplots_adjust(right=0.85) # Menyesuaikan plot untuk memberi ruang pada colorbar
plt.title("Correlations among Stores", fontsize=20)
plt.show()
And there is a graph that shows us daily total sales below.
# Data preparation
a = train.set_index("date").groupby("store_nbr").resample("D").sales.sum().reset_index()
# Plot
fig = px.line(
a,
x="date",
y="sales",
color="store_nbr",
title="Daily Total Sales of the Stores",
width=1000, # Mengatur lebar plot (dalam piksel)
height=600 # Mengatur panjang plot (dalam piksel)
)
# Tampilkan plot
fig.show()
Some rows in the store time series data are unnecessary. By examining the data for each store individually, it becomes clear that certain stores, specifically stores 20, 21, 22, 29, 36, 42, 52, and 53, show no sales at the beginning of 2013. To address this, we will remove the rows for these stores before their opening dates. The following code will handle the removal of these rows.
print(train.shape)
train = train[~((train.store_nbr == 52) & (train.date < "2017-04-20"))]
train = train[~((train.store_nbr == 22) & (train.date < "2015-10-09"))]
train = train[~((train.store_nbr == 42) & (train.date < "2015-08-21"))]
train = train[~((train.store_nbr == 21) & (train.date < "2015-07-24"))]
train = train[~((train.store_nbr == 29) & (train.date < "2015-03-20"))]
train = train[~((train.store_nbr == 20) & (train.date < "2015-02-13"))]
train = train[~((train.store_nbr == 53) & (train.date < "2014-05-29"))]
train = train[~((train.store_nbr == 36) & (train.date < "2013-05-09"))]
train.shape
(3000888, 6)
(2780316, 6)
# To know features in train and test data
print(train.columns)
print(test.columns)
Index(['id', 'date', 'store_nbr', 'family', 'sales', 'onpromotion'], dtype='object') Index(['id', 'date', 'store_nbr', 'family', 'onpromotion'], dtype='object')
# Add 'sales' feature in test_data
test['sales'] = 0
# Modelling
# Predict 'sales' feature in test_data according to each stores and sum of transactions
from sklearn.linear_model import LinearRegression
# Misalkan kita hanya menggunakan kolom 'onpromotion' dan 'store_nbr' dari train untuk prediksi
X_train = train[['onpromotion', 'store_nbr']]
y_train = train['sales']
X_test = test[['onpromotion', 'store_nbr']]
# Membuat dan melatih model
model = LinearRegression()
model.fit(X_train, y_train)
# Memprediksi nilai 'sales' untuk DataFrame test
test['sales'] = model.predict(X_test)
Predicted 'Sales' Value in test_data¶
print(test[['date', 'onpromotion', 'store_nbr', 'sales']].head(100))
date onpromotion store_nbr sales 0 2017-08-16 0 1 185.84 1 2017-08-16 0 1 185.84 2 2017-08-16 2 1 262.17 3 2017-08-16 20 1 949.14 4 2017-08-16 0 1 185.84 .. ... ... ... ... 95 2017-08-16 0 11 221.78 96 2017-08-16 163 11 6442.74 97 2017-08-16 14 11 756.09 98 2017-08-16 0 11 221.78 99 2017-08-16 0 12 225.37 [100 rows x 4 columns]
6. Holidays and Events¶
What a mess! Probably, you are confused due to the holidays and events data. It contains a lot of information inside but, don't worry. You just need to take a breathe and think! It is a meta-data so you have to split it logically and make the data useful.
What are our problems?
- Some national holidays have been transferred.
- There might be a few holidays in one day. When we merged all of data, number of rows might increase. We don't want duplicates.
- What is the scope of holidays? It can be regional or national or local. You need to split them by the scope.
- Work day issue
- Some specific events
- Creating new features etc.
End of the section, they won't be a problem anymore!
# Read the holidays dataset
holidays = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\holidays_events.csv')
holidays["date"] = pd.to_datetime(holidays.date)
# Separate transferred holidays and transfer type holidays
tr1 = holidays[(holidays.type == "Holiday") & (holidays.transferred == True)].drop("transferred", axis=1).reset_index(drop=True)
tr2 = holidays[holidays.type == "Transfer"].drop("transferred", axis=1).reset_index(drop=True)
tr = pd.concat([tr1, tr2], axis=1)
tr = tr.iloc[:, [5, 1, 2, 3, 4]]
# Filter out non-transferred holidays and non-transfer type holidays
holidays = holidays[(holidays.transferred == False) & (holidays.type != "Transfer")].drop("transferred", axis=1)
# Concatenate the dataframes and reset index
holidays = pd.concat([holidays, tr], ignore_index=True)
# Display the resulting DataFrame
print(holidays.head())
date type locale locale_name description 0 2012-03-02 Holiday Local Manta Fundacion de Manta 1 2012-04-01 Holiday Regional Cotopaxi Provincializacion de Cotopaxi 2 2012-04-12 Holiday Local Cuenca Fundacion de Cuenca 3 2012-04-14 Holiday Local Libertad Cantonizacion de Libertad 4 2012-04-21 Holiday Local Riobamba Cantonizacion de Riobamba
import gc
# Read and preprocess holidays dataset
holidays = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\holidays_events.csv')
holidays["date"] = pd.to_datetime(holidays.date)
# Separate transferred holidays and transfer type holidays
tr1 = holidays[(holidays.type == "Holiday") & (holidays.transferred == True)].drop("transferred", axis=1).reset_index(drop=True)
tr2 = holidays[holidays.type == "Transfer"].drop("transferred", axis=1).reset_index(drop=True)
tr = pd.concat([tr1, tr2], axis=1)
tr = tr.iloc[:, [5, 1, 2, 3, 4]]
# Filter out non-transferred holidays and non-transfer type holidays
holidays = holidays[(holidays.transferred == False) & (holidays.type != "Transfer")].drop("transferred", axis=1)
# Concatenate the DataFrames using pd.concat() instead of append
holidays = pd.concat([holidays, tr], ignore_index=True)
# Additional Holidays
holidays["description"] = holidays["description"].str.replace("-", "").str.replace("+", "").str.replace('\d+', '')
holidays["type"] = np.where(holidays["type"] == "Additional", "Holiday", holidays["type"])
# Bridge Holidays
holidays["description"] = holidays["description"].str.replace("Puente ", "")
holidays["type"] = np.where(holidays["type"] == "Bridge", "Holiday", holidays["type"])
# Work Day Holidays, meant to payback the Bridge.
work_day = holidays[holidays.type == "Work Day"]
holidays = holidays[holidays.type != "Work Day"]
# Split the holidays data into different categories
events = holidays[holidays.type == "Event"].drop(["type", "locale", "locale_name"], axis=1).rename({"description": "events"}, axis=1)
holidays = holidays[holidays.type != "Event"].drop("type", axis=1)
regional = holidays[holidays.locale == "Regional"].rename({"locale_name": "state", "description": "holiday_regional"}, axis=1).drop("locale", axis=1).drop_duplicates()
national = holidays[holidays.locale == "National"].rename({"description": "holiday_national"}, axis=1).drop(["locale", "locale_name"], axis=1).drop_duplicates()
local = holidays[holidays.locale == "Local"].rename({"description": "holiday_local", "locale_name": "city"}, axis=1).drop("locale", axis=1).drop_duplicates()
# Merge with main data
d = pd.merge(pd.concat([train, test]), stores)
d["store_nbr"] = d["store_nbr"].astype("int8")
# Merge national holidays & events
d = pd.merge(d, national, how="left")
d = pd.merge(d, regional, how="left", on=["date", "state"])
d = pd.merge(d, local, how="left", on=["date", "city"])
# Merge work day information
d = pd.merge(d, work_day[["date", "type"]].rename({"type": "IsWorkDay"}, axis=1), how="left")
# Process events
events["events"] = np.where(events.events.str.contains("futbol"), "Futbol", events.events)
def one_hot_encoder(df, nan_as_category=True):
original_columns = list(df.columns)
categorical_columns = df.select_dtypes(["category", "object"]).columns.tolist()
df = pd.get_dummies(df, columns=categorical_columns, dummy_na=nan_as_category)
new_columns = [c for c in df.columns if c not in original_columns]
df.columns = df.columns.str.replace(" ", "_")
return df, df.columns.tolist()
events, events_cat = one_hot_encoder(events, nan_as_category=False)
events["events_Dia_de_la_Madre"] = np.where(events.date == "2016-05-08", 1, events["events_Dia_de_la_Madre"])
events = events.drop(239)
d = pd.merge(d, events, how="left")
d[events_cat] = d[events_cat].fillna(0)
# Create new features
d["holiday_national_binary"] = np.where(d.holiday_national.notnull(), 1, 0)
d["holiday_local_binary"] = np.where(d.holiday_local.notnull(), 1, 0)
d["holiday_regional_binary"] = np.where(d.holiday_regional.notnull(), 1, 0)
d["national_independence"] = np.where(d.holiday_national.isin(['Batalla de Pichincha', 'Independencia de Cuenca', 'Independencia de Guayaquil', 'Independencia de Guayaquil', 'Primer Grito de Independencia']), 1, 0)
d["local_cantonizacio"] = np.where(d.holiday_local.str.contains("Cantonizacio"), 1, 0)
d["local_fundacion"] = np.where(d.holiday_local.str.contains("Fundacion"), 1, 0)
d["local_independencia"] = np.where(d.holiday_local.str.contains("Independencia"), 1, 0)
holidays, holidays_cat = one_hot_encoder(d[["holiday_national", "holiday_regional", "holiday_local"]], nan_as_category=False)
d = pd.concat([d.drop(["holiday_national", "holiday_regional", "holiday_local"], axis=1), holidays], axis=1)
he_cols = d.columns[d.columns.str.startswith("events")].tolist() + d.columns[d.columns.str.startswith("holiday")].tolist() + d.columns[d.columns.str.startswith("national")].tolist() + d.columns[d.columns.str.startswith("local")].tolist()
d[he_cols] = d[he_cols].astype("int8")
d[["family", "city", "state", "type"]] = d[["family", "city", "state", "type"]].astype("category")
# Clean up
del holidays, holidays_cat, work_day, local, regional, national, events, events_cat, tr, tr1, tr2, he_cols
gc.collect()
d.head(10)
| id | date | store_nbr | family | sales | onpromotion | city | state | type | cluster | IsWorkDay | events_Black_Friday | events_Cyber_Monday | events_Dia_de_la_Madre | events_Futbol | events_Terremoto_Manabi | events_Terremoto_Manabi1 | events_Terremoto_Manabi10 | events_Terremoto_Manabi11 | events_Terremoto_Manabi12 | events_Terremoto_Manabi13 | events_Terremoto_Manabi14 | events_Terremoto_Manabi15 | events_Terremoto_Manabi16 | events_Terremoto_Manabi17 | events_Terremoto_Manabi18 | events_Terremoto_Manabi19 | events_Terremoto_Manabi2 | events_Terremoto_Manabi20 | events_Terremoto_Manabi21 | events_Terremoto_Manabi22 | events_Terremoto_Manabi23 | events_Terremoto_Manabi24 | events_Terremoto_Manabi25 | events_Terremoto_Manabi26 | events_Terremoto_Manabi27 | events_Terremoto_Manabi28 | events_Terremoto_Manabi29 | events_Terremoto_Manabi3 | events_Terremoto_Manabi30 | events_Terremoto_Manabi4 | events_Terremoto_Manabi5 | events_Terremoto_Manabi6 | events_Terremoto_Manabi7 | events_Terremoto_Manabi8 | events_Terremoto_Manabi9 | holiday_national_binary | holiday_local_binary | holiday_regional_binary | national_independence | local_cantonizacio | local_fundacion | local_independencia | holiday_national_Batalla_de_Pichincha | holiday_national_Carnaval | holiday_national_Dia_de_Difuntos | holiday_national_Dia_de_la_Madre1 | holiday_national_Dia_del_Trabajo | holiday_national_Independencia_de_Cuenca | holiday_national_Independencia_de_Guayaquil | holiday_national_Navidad | holiday_national_Navidad1 | holiday_national_Navidad2 | holiday_national_Navidad3 | holiday_national_Navidad4 | holiday_national_Primer_Grito_de_Independencia | holiday_national_Primer_dia_del_ano | holiday_national_Primer_dia_del_ano1 | holiday_national_Viernes_Santo | holiday_regional_Provincializacion_Santa_Elena | holiday_regional_Provincializacion_de_Cotopaxi | holiday_regional_Provincializacion_de_Imbabura | holiday_regional_Provincializacion_de_Santo_Domingo | holiday_local_Cantonizacion_de_Cayambe | holiday_local_Cantonizacion_de_El_Carmen | holiday_local_Cantonizacion_de_Guaranda | holiday_local_Cantonizacion_de_Latacunga | holiday_local_Cantonizacion_de_Libertad | holiday_local_Cantonizacion_de_Quevedo | holiday_local_Cantonizacion_de_Riobamba | holiday_local_Cantonizacion_de_Salinas | holiday_local_Cantonizacion_del_Puyo | holiday_local_Fundacion_de_Ambato | holiday_local_Fundacion_de_Cuenca | holiday_local_Fundacion_de_Esmeraldas | holiday_local_Fundacion_de_Guayaquil | holiday_local_Fundacion_de_Guayaquil1 | holiday_local_Fundacion_de_Ibarra | holiday_local_Fundacion_de_Loja | holiday_local_Fundacion_de_Machala | holiday_local_Fundacion_de_Manta | holiday_local_Fundacion_de_Quito | holiday_local_Fundacion_de_Quito1 | holiday_local_Fundacion_de_Riobamba | holiday_local_Fundacion_de_Santo_Domingo | holiday_local_Independencia_de_Ambato | holiday_local_Independencia_de_Guaranda | holiday_local_Independencia_de_Latacunga | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2013-01-01 | 1 | AUTOMOTIVE | 0.00 | 0.00 | Quito | Pichincha | D | 13 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 2013-01-01 | 1 | BABY CARE | 0.00 | 0.00 | Quito | Pichincha | D | 13 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | 2013-01-01 | 1 | BEAUTY | 0.00 | 0.00 | Quito | Pichincha | D | 13 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 3 | 2013-01-01 | 1 | BEVERAGES | 0.00 | 0.00 | Quito | Pichincha | D | 13 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 4 | 2013-01-01 | 1 | BOOKS | 0.00 | 0.00 | Quito | Pichincha | D | 13 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 5 | 2013-01-01 | 1 | BREAD/BAKERY | 0.00 | 0.00 | Quito | Pichincha | D | 13 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 6 | 2013-01-01 | 1 | CELEBRATION | 0.00 | 0.00 | Quito | Pichincha | D | 13 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 7 | 2013-01-01 | 1 | CLEANING | 0.00 | 0.00 | Quito | Pichincha | D | 13 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 8 | 2013-01-01 | 1 | DAIRY | 0.00 | 0.00 | Quito | Pichincha | D | 13 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 9 | 2013-01-01 | 1 | DELI | 0.00 | 0.00 | Quito | Pichincha | D | 13 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Let's perform an AB test on the Events and Holidays features to determine their statistical significance. This could also serve as an initial method for feature selection.
- Null Hypothesis (H0): Sales are equal (M1 = M2)
- Alternative Hypothesis (H1): Sales are not equal (M1 ≠M2)
def AB_Test(dataframe, group, target):
# Packages
from scipy.stats import shapiro
import scipy.stats as stats
# Split A/B
groupA = dataframe[dataframe[group] == 1][target]
groupB = dataframe[dataframe[group] == 0][target]
# Assumption: Normality
ntA = shapiro(groupA)[1] < 0.05
ntB = shapiro(groupB)[1] < 0.05
# H0: Distribution is Normal! - False
# H1: Distribution is not Normal! - True
if (ntA == False) & (ntB == False): # "H0: Normal Distribution"
# Parametric Test
# Assumption: Homogeneity of variances
leveneTest = stats.levene(groupA, groupB)[1] < 0.05
# H0: Homogeneity: False
# H1: Heterogeneous: True
if leveneTest == False:
# Homogeneity
ttest = stats.ttest_ind(groupA, groupB, equal_var=True)[1]
# H0: M1 == M2 - False
# H1: M1 != M2 - True
else:
# Heterogeneous
ttest = stats.ttest_ind(groupA, groupB, equal_var=False)[1]
# H0: M1 == M2 - False
# H1: M1 != M2 - True
else:
# Non-Parametric Test
ttest = stats.mannwhitneyu(groupA, groupB)[1]
# H0: M1 == M2 - False
# H1: M1 != M2 - True
# Result
temp = pd.DataFrame({
"AB Hypothesis":[ttest < 0.05],
"p-value":[ttest]
})
temp["Test Type"] = np.where((ntA == False) & (ntB == False), "Parametric", "Non-Parametric")
temp["AB Hypothesis"] = np.where(temp["AB Hypothesis"] == False, "Fail to Reject H0", "Reject H0")
temp["Comment"] = np.where(temp["AB Hypothesis"] == "Fail to Reject H0", "A/B groups are similar!", "A/B groups are not similar!")
temp["Feature"] = group
temp["GroupA_mean"] = groupA.mean()
temp["GroupB_mean"] = groupB.mean()
temp["GroupA_median"] = groupA.median()
temp["GroupB_median"] = groupB.median()
# Columns
if (ntA == False) & (ntB == False):
temp["Homogeneity"] = np.where(leveneTest == False, "Yes", "No")
temp = temp[["Feature","Test Type", "Homogeneity","AB Hypothesis", "p-value", "Comment", "GroupA_mean", "GroupB_mean", "GroupA_median", "GroupB_median"]]
else:
temp = temp[["Feature","Test Type","AB Hypothesis", "p-value", "Comment", "GroupA_mean", "GroupB_mean", "GroupA_median", "GroupB_median"]]
# Print Hypothesis
# print("# A/B Testing Hypothesis")
# print("H0: A == B")
# print("H1: A != B", "\n")
return temp
# Apply A/B Testing
he_cols = d.columns[d.columns.str.startswith("events")].tolist() + d.columns[d.columns.str.startswith("holiday")].tolist() + d.columns[d.columns.str.startswith("national")].tolist()+ d.columns[d.columns.str.startswith("local")].tolist()
ab = []
for i in he_cols:
ab.append(AB_Test(dataframe=d[d.sales.notnull()], group = i, target = "sales"))
ab = pd.concat(ab)
ab
| Feature | Test Type | AB Hypothesis | p-value | Comment | GroupA_mean | GroupB_mean | GroupA_median | GroupB_median | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | events_Black_Friday | Non-Parametric | Reject H0 | 0.00 | A/B groups are not similar! | 382.24 | 387.89 | 23.00 | 17.00 |
| 0 | events_Cyber_Monday | Non-Parametric | Reject H0 | 0.00 | A/B groups are not similar! | 458.88 | 387.75 | 21.00 | 17.00 |
| 0 | events_Dia_de_la_Madre | Non-Parametric | Reject H0 | 0.00 | A/B groups are not similar! | 374.92 | 387.91 | 14.00 | 17.00 |
| 0 | events_Futbol | Non-Parametric | Reject H0 | 0.00 | A/B groups are not similar! | 349.84 | 388.18 | 11.00 | 17.00 |
| 0 | events_Terremoto_Manabi | Non-Parametric | Reject H0 | 0.00 | A/B groups are not similar! | 492.92 | 387.81 | 34.00 | 17.00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 0 | holiday_local_Independencia_de_Latacunga | Non-Parametric | Reject H0 | 0.00 | A/B groups are not similar! | 170.47 | 387.90 | 8.50 | 17.00 |
| 0 | national_independence | Non-Parametric | Reject H0 | 0.00 | A/B groups are not similar! | 443.59 | 387.28 | 20.00 | 17.00 |
| 0 | local_cantonizacio | Non-Parametric | Reject H0 | 0.00 | A/B groups are not similar! | 387.52 | 488.00 | 17.00 | 23.00 |
| 0 | local_fundacion | Non-Parametric | Reject H0 | 0.00 | A/B groups are not similar! | 387.99 | 239.87 | 17.00 | 11.00 |
| 0 | local_independencia | Non-Parametric | Reject H0 | 0.00 | A/B groups are not similar! | 387.57 | 467.87 | 17.00 | 22.00 |
87 rows × 9 columns
# Check if 'd' is defined
if 'd' in locals():
print("DataFrame 'd' is defined and ready for use.")
else:
print("DataFrame 'd' is not defined. Please define it before running the A/B tests.")
DataFrame 'd' is defined and ready for use.
# Average sales for different product categories (family) during periods with and without a "Black Friday" event (events_Black_Friday).
d.groupby(["family","events_Black_Friday"]).sales.mean()[:20]
family events_Black_Friday
AUTOMOTIVE 0 9.39
1 5.89
BABY CARE 0 2.98
1 0.08
BEAUTY 0 7.14
1 3.52
BEVERAGES 0 2562.21
1 2766.75
BOOKS 0 2.93
1 0.23
BREAD/BAKERY 0 502.55
1 422.58
CELEBRATION 0 11.80
1 15.84
CLEANING 0 1155.80
1 982.76
DAIRY 0 769.87
1 684.00
DELI 0 291.73
1 265.42
Name: sales, dtype: float64
e significantly to 15.84.
Summary:¶
The results indicate how average sales for various product categories change during the Black Friday period compared to normal periods. For some categories like "BEVERAGES" and "CELEBRATION," sales increase during Black Friday, possibly due to promotions or increased consumer spending during this time. However, other categories like "AUTOMOTIVE" and "CLEANING" see a decrease in average sales during Black Friday, which could suggest that customers prioritize different products during this event.
7. Simple Moving Average¶
a = train.sort_values(["store_nbr", "family", "date"])
for i in [20, 30, 45, 60, 90, 120, 365, 730]:
a["SMA"+str(i)+"_sales_lag16"] = a.groupby(["store_nbr", "family"]).rolling(i).sales.mean().shift(16).values
a["SMA"+str(i)+"_sales_lag30"] = a.groupby(["store_nbr", "family"]).rolling(i).sales.mean().shift(30).values
a["SMA"+str(i)+"_sales_lag60"] = a.groupby(["store_nbr", "family"]).rolling(i).sales.mean().shift(60).values
print("Correlation")
a[["sales"]+a.columns[a.columns.str.startswith("SMA")].tolist()].corr()
Correlation
| sales | SMA20_sales_lag16 | SMA20_sales_lag30 | SMA20_sales_lag60 | SMA30_sales_lag16 | SMA30_sales_lag30 | SMA30_sales_lag60 | SMA45_sales_lag16 | SMA45_sales_lag30 | SMA45_sales_lag60 | SMA60_sales_lag16 | SMA60_sales_lag30 | SMA60_sales_lag60 | SMA90_sales_lag16 | SMA90_sales_lag30 | SMA90_sales_lag60 | SMA120_sales_lag16 | SMA120_sales_lag30 | SMA120_sales_lag60 | SMA365_sales_lag16 | SMA365_sales_lag30 | SMA365_sales_lag60 | SMA730_sales_lag16 | SMA730_sales_lag30 | SMA730_sales_lag60 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sales | 1.00 | 0.91 | 0.89 | 0.87 | 0.91 | 0.89 | 0.87 | 0.91 | 0.90 | 0.87 | 0.91 | 0.89 | 0.87 | 0.91 | 0.89 | 0.87 | 0.90 | 0.89 | 0.86 | 0.90 | 0.89 | 0.87 | 0.91 | 0.89 | 0.86 |
| SMA20_sales_lag16 | 0.91 | 1.00 | 0.99 | 0.95 | 1.00 | 0.98 | 0.95 | 0.99 | 0.98 | 0.95 | 0.99 | 0.98 | 0.95 | 0.99 | 0.98 | 0.95 | 0.98 | 0.97 | 0.95 | 0.97 | 0.97 | 0.95 | 0.98 | 0.98 | 0.95 |
| SMA20_sales_lag30 | 0.89 | 0.99 | 1.00 | 0.96 | 0.99 | 1.00 | 0.96 | 1.00 | 0.99 | 0.97 | 0.99 | 0.99 | 0.97 | 0.99 | 0.99 | 0.96 | 0.99 | 0.98 | 0.96 | 0.97 | 0.97 | 0.96 | 0.98 | 0.98 | 0.96 |
| SMA20_sales_lag60 | 0.87 | 0.95 | 0.96 | 1.00 | 0.96 | 0.98 | 1.00 | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.98 | 0.98 | 0.97 | 0.97 | 0.98 | 0.98 | 0.98 |
| SMA30_sales_lag16 | 0.91 | 1.00 | 0.99 | 0.96 | 1.00 | 0.99 | 0.96 | 1.00 | 0.99 | 0.96 | 0.99 | 0.99 | 0.96 | 0.99 | 0.98 | 0.96 | 0.99 | 0.98 | 0.96 | 0.98 | 0.97 | 0.96 | 0.98 | 0.98 | 0.96 |
| SMA30_sales_lag30 | 0.89 | 0.98 | 1.00 | 0.98 | 0.99 | 1.00 | 0.98 | 1.00 | 1.00 | 0.98 | 1.00 | 0.99 | 0.98 | 0.99 | 0.99 | 0.97 | 0.99 | 0.99 | 0.97 | 0.98 | 0.98 | 0.97 | 0.98 | 0.98 | 0.98 |
| SMA30_sales_lag60 | 0.87 | 0.95 | 0.96 | 1.00 | 0.96 | 0.98 | 1.00 | 0.98 | 0.99 | 1.00 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 |
| SMA45_sales_lag16 | 0.91 | 0.99 | 1.00 | 0.98 | 1.00 | 1.00 | 0.98 | 1.00 | 1.00 | 0.98 | 1.00 | 0.99 | 0.98 | 0.99 | 0.99 | 0.98 | 0.99 | 0.99 | 0.97 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 |
| SMA45_sales_lag30 | 0.90 | 0.98 | 0.99 | 0.99 | 0.99 | 1.00 | 0.99 | 1.00 | 1.00 | 0.99 | 1.00 | 1.00 | 0.99 | 1.00 | 0.99 | 0.98 | 0.99 | 0.99 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 |
| SMA45_sales_lag60 | 0.87 | 0.95 | 0.97 | 0.99 | 0.96 | 0.98 | 1.00 | 0.98 | 0.99 | 1.00 | 0.99 | 0.99 | 1.00 | 0.99 | 1.00 | 0.99 | 0.99 | 1.00 | 0.99 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 |
| SMA60_sales_lag16 | 0.91 | 0.99 | 0.99 | 0.99 | 0.99 | 1.00 | 0.99 | 1.00 | 1.00 | 0.99 | 1.00 | 1.00 | 0.99 | 1.00 | 0.99 | 0.98 | 0.99 | 0.99 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 |
| SMA60_sales_lag30 | 0.89 | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 1.00 | 0.99 | 1.00 | 1.00 | 0.99 | 1.00 | 1.00 | 0.99 | 1.00 | 0.99 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 |
| SMA60_sales_lag60 | 0.87 | 0.95 | 0.97 | 0.99 | 0.96 | 0.98 | 0.99 | 0.98 | 0.99 | 1.00 | 0.99 | 0.99 | 1.00 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 0.99 | 0.99 | 0.98 | 0.98 | 0.98 | 0.98 |
| SMA90_sales_lag16 | 0.91 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 1.00 | 0.99 | 1.00 | 1.00 | 0.99 | 1.00 | 1.00 | 0.99 | 1.00 | 1.00 | 0.99 | 0.99 | 0.99 | 0.98 | 0.99 | 0.98 | 0.98 |
| SMA90_sales_lag30 | 0.89 | 0.98 | 0.99 | 0.99 | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 1.00 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 1.00 | 1.00 | 0.99 | 0.99 | 0.99 | 0.98 | 0.98 | 0.99 | 0.98 |
| SMA90_sales_lag60 | 0.87 | 0.95 | 0.96 | 0.99 | 0.96 | 0.97 | 0.99 | 0.98 | 0.98 | 0.99 | 0.98 | 0.99 | 1.00 | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 0.99 | 0.99 | 0.98 | 0.98 | 0.99 |
| SMA120_sales_lag16 | 0.90 | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 |
| SMA120_sales_lag30 | 0.89 | 0.97 | 0.98 | 0.99 | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 1.00 | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 |
| SMA120_sales_lag60 | 0.86 | 0.95 | 0.96 | 0.98 | 0.96 | 0.97 | 0.99 | 0.97 | 0.98 | 0.99 | 0.98 | 0.98 | 0.99 | 0.99 | 0.99 | 1.00 | 0.99 | 1.00 | 1.00 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 |
| SMA365_sales_lag16 | 0.90 | 0.97 | 0.97 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| SMA365_sales_lag30 | 0.89 | 0.97 | 0.97 | 0.97 | 0.97 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| SMA365_sales_lag60 | 0.87 | 0.95 | 0.96 | 0.97 | 0.96 | 0.97 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| SMA730_sales_lag16 | 0.91 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.99 | 0.98 | 0.98 | 0.99 | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| SMA730_sales_lag30 | 0.89 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.99 | 0.98 | 0.99 | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| SMA730_sales_lag60 | 0.86 | 0.95 | 0.96 | 0.98 | 0.96 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Summary of the Correlation Results:¶
The correlation matrix displays the relationship between the original sales and various lagged Simple Moving Averages (SMA) of different window sizes. Here's a summary of the key points:
Strong Correlation with Sales:
- The original
salesvariable has a strong positive correlation with all the lagged SMA features, with correlations ranging from 0.86 to 0.91. This indicates that past sales data, averaged over different periods, is highly predictive of current sales.
- The original
Correlation Among SMA Features:
- The SMA features are also highly correlated with each other. For example, the correlation between
SMA20_sales_lag16andSMA20_sales_lag30is 0.99. This high correlation suggests that different lag periods for the same window size produce similar trends in the data.
- The SMA features are also highly correlated with each other. For example, the correlation between
Decreasing Correlation with Longer Lags:
- As the lag increases (e.g., from 16 days to 60 days), the correlation between the SMA features and
salesslightly decreases. This trend is visible across all window sizes, showing that more recent data is generally more predictive.
- As the lag increases (e.g., from 16 days to 60 days), the correlation between the SMA features and
Window Size Influence:
- The correlation between
salesand SMA features slightly decreases as the window size increases (e.g., from 20 to 730 days). However, the correlation remains strong across all window sizes, indicating that both short-term and long-term trends are relevant to predicting sales.
- The correlation between
Overall Patterns:
- The matrix suggests that any of these SMA features could be useful predictors in a forecasting model. However, the selection of specific lags and window sizes may depend on the desired balance between capturing short-term versus long-term trends in the data.
Implications for Modeling:¶
- The strong correlations indicate that the SMA features will likely be valuable in a sales forecasting model. However, the high multicollinearity among these features may require dimensionality reduction techniques, such as PCA, or the careful selection of non-redundant features to avoid overfitting in predictive models.
b = a[(a.store_nbr == 5)].set_index("date")
for i in b.family.unique():
fig, ax = plt.subplots(2,4,figsize=(20,10))
b[b.family == i][["sales", "SMA20_sales_lag16"]].plot(legend = True, ax = ax[0,0], linewidth = 4)
b[b.family == i][["sales", "SMA30_sales_lag16"]].plot(legend = True, ax = ax[0,1], linewidth = 4)
b[b.family == i][["sales", "SMA45_sales_lag16"]].plot(legend = True, ax = ax[0,2], linewidth = 4)
b[b.family == i][["sales", "SMA60_sales_lag16"]].plot(legend = True, ax = ax[0,3], linewidth = 4)
b[b.family == i][["sales", "SMA90_sales_lag16"]].plot(legend = True, ax = ax[1,0], linewidth = 4)
b[b.family == i][["sales", "SMA120_sales_lag16"]].plot(legend = True, ax = ax[1,1], linewidth = 4)
b[b.family == i][["sales", "SMA365_sales_lag16"]].plot(legend = True, ax = ax[1,2], linewidth = 4)
b[b.family == i][["sales", "SMA730_sales_lag16"]].plot(legend = True, ax = ax[1,3], linewidth = 4)
plt.suptitle("STORE 1 - "+i, fontsize = 15)
plt.tight_layout(pad = 1.5)
for j in range(0,4):
ax[0,j].legend(fontsize="x-large")
ax[1,j].legend(fontsize="x-large")
plt.show()